Supporting Text Retrieval by Typographical Term Weighting

نویسندگان

  • Lars Werner
  • Stefan Böttcher
چکیده

Text documents stored in information systems usually consist of more information than the pure concatenation of words, i.e., they also contain typographic information. Because conventional text retrieval methods evaluate only the word frequency, they miss the information provided by typography, e.g., regarding the importance of certain terms. In order to overcome this weakness, we present an approach which uses the typographical information of text documents and show how this improves the efficiency of text retrieval methods. Our approach uses weighting of typographic information in addition to term frequencies for separating relevant information in text documents from the noise. We have evaluated our approach on the basis of automated text classification algorithms. The results show that our weighting approach achieves very competitive classification results using at most 30% of the terms used by conventional approaches, which makes our approach significantly more efficient.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Effect of Term Importance Degree on Text Retrieval

Various approaches to index term-weighting have been investigated. In fact, term-weighting is an indispensable process for document ranking in most retrieval systems. As well actual information retrieval systems have to deal with explosive growth of documents of various sizes and terms of various frequencies because an appropriate term-weighting scheme has a crucial impact on the overall perfor...

متن کامل

A new term-weighting scheme for text classification using the odds of positive and negative class probabilities

Text classification is a core technique for text mining and information retrieval. It has been applied to many applications in many different research and industrial areas. Term weighting schemes have to assign an appropriate weight to each term to obtain a high text classification performance. Although term weighting is one of the important modules for text classification, and text classificat...

متن کامل

An Integrated and Improved Approach to Terms Weighting in Text Classification

Traditional text classification methods utilize term frequency (tf) and inverse document frequency (idf) as the main method for information retrieval. Term weighting has been applied to achieve high performance in text classification. Although TFIDF is a popular method, it is not using class information. This paper provides an improved approach for supervised weighting in the TFIDF model. The t...

متن کامل

Text Information Retrieval Approach to Music Information Retrieval

This MIREX submission for symbolic music similarity task adopts textual information retrieval methodology in the process of music information retrieval. The main contribution of this approach is to utilize well established term weighting methods for text retrieval and check their suitability for music data. We use a simple feature extraction method, so that the performance of an algorithm depen...

متن کامل

Probability-Based Chinese Text Processing and Retrieval

We discuss the use of probability-based natural language processing for Chinese text retrieval. We focus on comparing different text extraction methods and probabilistic weighting methods. Several document processing methods and probabilistic weighting functions are presented. A number of experiments have been conducted on large standard text collections. We present the experimental results tha...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • IJIIT

دوره 3  شماره 

صفحات  -

تاریخ انتشار 2007